AITopics | planning ability

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Neural Information Processing SystemsFeb-11-2026, 05:33:03 GMT

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Africa > South Africa > Gauteng > Johannesburg (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

On the Planning Abilities of Large Language Models - A Critical Investigation

Neural Information Processing SystemsDec-27-2025, 04:14:29 GMT

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs as a source of heuristic guidance for other agents (AI planners) in their planning tasks. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains.

language model, name change, planning ability, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Neural Information Processing SystemsDec-26-2025, 23:09:06 GMT

Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B,

cogeval, cognitive map and planning, language model, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Neural Information Processing SystemsDec-24-2025, 23:56:29 GMT

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing \texttt{GameTraversalBenchmark (GTB)}, a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on \texttt{GTB} and found that GPT-4-Turbo achieved the highest score of $44.97\%$ on \texttt{GTB\_Score} (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores $67.84\%$ on GTBS, indicating that the benchmark remains challenging for current models.

gametraversalbenchmark, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

3852c8254bc6d904c09db9921157f59b-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 23:25:50 GMT

arxiv preprint arxiv, llm, objective, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
Africa > South Africa > Gauteng > Johannesburg (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Neural Information Processing SystemsMay-26-2025, 21:24:23 GMT

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing \texttt{GameTraversalBenchmark (GTB)}, a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on \texttt{GTB} and found that GPT-4-Turbo achieved the highest score of 44.97\% on \texttt{GTB\_Score} (GTBS), a composite score that combines the three above criteria.

gametraversalbenchmark, large language model, natural language, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

Wu, Zirui, Liu, Xiao, Li, Jiayi, Kong, Lingpeng, Feng, Yansong

arXiv.org Artificial IntelligenceMar-3-2025

While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open-source our benchmark and code at https://github.com/WilliamZR/Recipe2Plan.

constraint, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2503.02238

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Workflow (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

On the Planning Abilities of Large Language Models - A Critical Investigation

Neural Information Processing SystemsJan-20-2025, 02:17:09 GMT

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs as a source of heuristic guidance for other agents (AI planners) in their planning tasks. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains. In the heuristic mode, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

critical investigation, language model, planning ability, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Neural Information Processing SystemsJan-20-2025, 00:26:17 GMT

Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B,

cogeval, cognitive map and planning, language model, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers

Redkar, Neel

arXiv.org Artificial IntelligenceNov-19-2024

Low-Rank Adaptation (LoRA) layers have emerged as a promising approach for efficient model fine-tuning, but their capabilities and limitations have not been fully explored. This paper: 1) Investigates the fundamental question of whether LoRA layers are effective at increasing reasoning + planning abilities 2) We introduce HashChain Reasoning, a novel evaluation dataset that deterministically tests reasoning capabilities. Through systematic ablation studies on GPT-2, we demonstrate that reasoning capabilities appear to exist primarily in low-rank spaces and can be effectively enhanced using LoRA layers. The effective rank analysis of trained LoRA matrices reveals a 2-3x lower rank requirement for reasoning tasks compared to planning tasks, giving context on where LoRA layers would be effective. This also provides evidence for reasoning fundamentally preferring low-parameter spaces for generalization.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.00029

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.91)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)

Add feedback

Filters

Collaborating Authors

planning ability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

On the Planning Abilities of Large Language Models - A Critical Investigation

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

3852c8254bc6d904c09db9921157f59b-Paper-Datasets_and_Benchmarks_Track.pdf

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

On the Planning Abilities of Large Language Models - A Critical Investigation

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers